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Abstract 

From  as  early  as  6  months  of  age,  human  children 
distinguish  between  motion  patterns  generated  by 
animate  objects  from  patterns  generated  by  mov¬ 
ing  inanimate  objects,  even  when  the  only  stimu¬ 
lus  that  the  child  observes  is  a  single  point  of  light 
moving  against  a  blank  background.  The  mecha¬ 
nisms  by  which  the  animate/inanimate  distinction 
are  made  are  unknown,  but  have  been  shown  to  rely 
only  upon  the  spatial  and  temporal  properties  of  the 
movement.  In  this  paper,  I  present  both  a  multi¬ 
agent  architecture  that  performs  this  classification 
as  well  as  detailed  comparisons  of  the  individual 
agent  contributions  against  human  baselines. 

1  Introduction 

One  of  the  most  basic  visual  skills  is  the  ability  to  distinguish 
animate  from  inanimate  objects.  We  can  easily  distinguish 
between  the  movement  of  a  clock  pendulum  that  swings  back 
and  forth  on  the  wall  from  the  movement  of  a  mouse  running 
back  and  forth  across  the  floor.  Michotte  [1962]  first  doc¬ 
umented  that  adults  have  a  natural  tendency  to  describe  the 
movement  of  animate  objects  in  terms  of  intent  and  desire, 
while  the  movements  of  inanimate  objects  are  described  in 
terms  of  the  physical  forces  that  act  upon  them  and  the  phys¬ 
ical  laws  that  govern  them.  Furthermore,  by  using  only  sin¬ 
gle  moving  points  of  light  on  a  blank  background,  Michotte 
showed  that  these  perceptions  can  be  guided  by  even  simple 
visual  motion  without  any  additional  context. 

Leslie  [1982]  proposed  that  this  distinction  between  an¬ 
imate  and  inanimate  objects  reflects  a  fundamental  differ¬ 
ence  in  how  we  reason  about  the  causal  properties  of  ob¬ 
jects.  According  to  Leslie,  people  effortlessly  classify  stimuli 
into  three  different  categories  based  on  the  types  of  causal  ex¬ 
planations  that  can  be  applied  to  those  objects,  and  different 
modules  in  the  brain  have  evolved  to  deal  with  each  of  these 
types  of  causation.  Inanimate  objects  are  described  in  terms 
of  mechanical  agency,  that  is,  they  can  be  explained  by  the 
rules  of  mechanics ,  and  are  processed  by  a  special-purpose 
reasoning  engine  called  the  Theory  of  Body  module  (ToBY) 
which  encapsulates  the  organism’s  intuitive  knowledge  about 
how  objects  move.  This  knowledge  may  not  match  the  actual 
physical  laws  that  govern  the  movement  of  objects,  but  rather 


is  our  intuitive  understanding  of  physics.  Animate  objects  are 
described  either  by  their  actions  or  by  their  attitudes,  and  are 
processed  by  the  Theory  of  Mind  module  which  has  some¬ 
times  been  called  an  “intuitive  psychology.”  System  1  of  the 
theory  of  mind  module  (ToMM-1)  explains  events  in  terms  of 
the  intent  and  goals  of  agents,  that  is,  their  actions.  For  ex¬ 
ample,  if  you  see  me  approaching  a  glass  of  water  you  might 
assume  that  I  want  the  water  because  I  am  thirsty.  System  2 
of  the  theory  of  mind  module  (ToMM-2)  explains  events  in 
terms  of  the  attitudes  and  beliefs  of  agents.  If  you  see  me 
approaching  a  glass  of  kerosene  and  lifting  it  to  my  lips,  you 
might  guess  that  I  believe  that  the  kerosene  is  actually  wa¬ 
ter.  Leslie  further  proposed  that  this  sensitivity  to  the  spatio- 
temporal  properties  of  events  is  innate,  but  more  recent  work 
from  Cohen  and  Amsel  [1998]  may  show  that  it  develops  ex¬ 
tremely  rapidly  in  the  first  few  months  and  is  fully  developed 
by  6-7  months. 

Although  many  researchers  have  attempted  to  document 
the  time  course  of  the  emergence  of  this  skill,  little  effort  has 
gone  into  identifying  the  mechanisms  of  how  an  adult  or  an 
infant  performs  this  classification.  This  paper  investigates  a 
number  of  simple  visual  strategies  that  attempt  to  perform 
the  classification  of  animate  from  inanimate  stimuli  based 
only  on  spatio-temporal  properties  without  additional  con¬ 
text.  These  strategies  have  been  implemented  on  a  humanoid 
robot  called  Cog  as  part  of  an  on-going  effort  to  establish  ba¬ 
sic  social  skills  and  to  provide  mechanisms  for  social  learning 
[Scassellati,  2000] .  A  set  of  basic  visual  feature  detectors  and 
a  context-sensitive  attention  system  (described  in  section  2) 
select  a  sequence  of  visual  targets  (see  Figure  1).  The  visual 
targets  in  each  frame  are  linked  together  temporally  to  form 
spatio-temporal  trajectories  (section  3).  These  trajectories  are 
then  processed  by  a  multi-agent  representation  that  mimics 
Leslie’s  ToBY  module  by  attempting  to  describe  trajectories 
in  terms  of  naive  physical  laws  (section  4).  The  results  of 
the  implemented  system  on  real-world  environments  are  in¬ 
troduced,  and  a  comparison  against  human  performance  on 
describing  identical  data  is  discussed  in  section  5. 

2  Visual  Precursors 

Cog’s  visual  system  has  been  designed  to  mimic  aspects  of  an 
infant’s  visual  system.  Human  infants  show  a  preference  for 
stimuli  that  exhibit  certain  low-level  feature  properties.  For 
example,  a  four-month- old  infant  is  more  likely  to  look  at  a 
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Figure  1:  Overall  architecture  for  distinguishing  animate 
from  inanimate  stimuli.  Visual  input  is  processed  by  a  set  of 
simple  feature  detectors,  each  of  which  contributes  to  a  visual 
attention  process.  Salient  objects  in  each  frame  are  linked  to¬ 
gether  to  form  spatio-temporal  trajectories,  which  are  then 
classified  by  the  “theory  of  body”  (ToB  Y)  module. 

moving  object  than  a  static  one,  or  a  face-like  object  than  one 
that  has  similar,  but  jumbled,  features  [Fagan,  1976].  Cog’s 
perceptual  system  combines  many  low-level  feature  detectors 
that  are  ecologically  relevant  to  an  infant.  Three  of  these  fea¬ 
tures  are  used  in  this  work:  color  saliency  analysis,  motion 
detection,  and  skin  color  detection.  These  low-level  features 
are  then  filtered  through  an  attentional  mechanism  which  de¬ 
termines  the  most  salient  objects  in  each  camera  frame. 

2.1  Pre-attentive  visual  routines 

The  color  saturation  filter  is  computed  using  an  opponent- 
process  model  that  identifies  saturated  areas  of  red,  green, 
blue,  and  yellow  [Itti  et  al. ,  1998].  The  color  channels  of 
the  incoming  video  stream  (r,  g,  and  b)  are  normalized  by 
the  luminance  l  and  transformed  into  four  color-opponency 
channels  (P,  g',b',  and  y')\ 
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The  four  opponent-color  channels  are  thresholded  and 
smoothed  to  produce  the  output  color  saliency  feature  map. 

In  parallel  with  the  color  saliency  computations,  The  mo¬ 
tion  detection  module  uses  temporal  differencing  and  region 
growing  to  obtain  bounding  boxes  of  moving  objects.  The  in¬ 
coming  image  is  converted  to  grayscale  and  placed  into  a  ring 
of  frame  buffers.  A  raw  motion  map  is  computed  by  passing 
the  absolute  difference  between  consecutive  images  through 
a  threshold  function  T : 

Mraw=T(\\It-It_1\\)  (5) 


This  raw  motion  map  is  then  smoothed  to  minimize  point 
noise  sources. 

The  third  pre-attentive  feature  detector  identifies  regions 
that  have  color  values  that  are  within  the  range  of  skin  tones 
[Breazeal  et  al. ,  2000] .  Incoming  images  are  first  filtered  by 
a  mask  that  identifies  candidate  areas  as  those  that  satisfy  the 
following  criteria  on  the  red,  green,  and  blue  pixel  compo¬ 
nents: 

2 g  >  r  >  1.1  g  2 b  >  r  >  0.9 b  250  >  r  >  20 

(6) 

The  final  weighting  of  each  region  is  determined  by  a  learned 
classification  function  that  was  trained  on  hand-classified  im¬ 
age  regions.  The  output  is  again  median  filtered  with  a  small 
support  area  to  minimize  noise. 

2.2  Visual  attention 

Low-level  perceptual  inputs  are  combined  with  high-level  in¬ 
fluences  from  motivations  and  habituation  effects  by  the  at¬ 
tention  system.  This  system  is  based  upon  models  of  adult  hu¬ 
man  visual  search  and  attention  [Wolfe,  1994],  and  has  been 
reported  previously  [Breazeal  and  Scassellati,  1999].  The  at¬ 
tention  process  constructs  a  linear  combination  of  the  input 
feature  detectors  and  a  time-decayed  Gaussian  field  which 
represents  habituation  effects.  High  areas  of  activation  in  this 
composite  generate  a  saccade  to  that  location  and  compen¬ 
satory  neck  movement.  The  weights  of  the  feature  detectors 
can  be  influenced  by  the  motivational  and  emotional  state  of 
the  robot  to  preferentially  bias  certain  stimuli.  For  example, 
if  the  robot  is  searching  for  a  playmate,  the  weight  of  the 
skin  detector  can  be  increased  to  cause  the  robot  to  show  a 
preference  for  attending  to  faces.  The  output  of  the  attention 
system  is  a  labeled  set  of  targets  for  each  camera  frame  that 
indicate  the  positions  (and  feature  properties)  of  the  k  most 
salient  targets.  For  the  experiments  presented  here,  km  5. 

3  Computing  Motion  Trajectories 

The  attention  system  indicates  the  most  salient  objects  at  each 
time  step,  but  does  not  give  any  indication  of  the  temporal 
properties  of  those  objects.  Trajectories  are  formed  using 
the  multiple  hypothesis  tracking  algorithm  proposed  by  Reid 
[1979]  and  implemented  by  Cox  and  Hingorani  [1996].  The 
centroids  of  the  attention  targets  form  a  stream  of  target  lo¬ 
cations  {P/,  Pt2,  ...Ptfc}  with  a  maximum  of  k  targets  present 
in  each  frame  t.  The  objective  is  to  produce  a  labeled  trajec¬ 
tory  which  consists  of  a  set  of  points,  at  most  one  from  each 
frame,  which  identify  a  single  object  in  the  world  as  it  moves 
through  the  field  of  view: 

T  =  {Pt,pt,-pin}  (7) 

However,  because  the  existence  of  a  target  from  one  frame 
to  the  next  is  uncertain,  we  must  introduce  a  mechanism  to 
compensate  for  objects  that  enter  and  leave  the  field  of  view 
and  to  compensate  for  irregularities  in  the  earlier  processing 
modules.  To  address  these  problems,  we  introduce  phantom 
points  that  have  undefined  locations  within  the  image  plane 
but  which  can  be  used  to  complete  trajectories  for  objects  that 
enter,  exit,  or  are  occluded  within  the  visual  field.  As  each 
new  point  is  introduced,  a  set  of  hypotheses  linking  that  point 


Figure  2:  The  last  frame  of  a  30  frame  sequence  with  five  tra¬ 
jectories  identified.  Four  nearly  stationary  trajectories  were 
found  (one  on  the  person’s  head,  one  on  the  person’s  hand, 
one  on  the  couch  in  the  background,  and  one  on  the  door  in 
the  background).  The  final  trajectory  resulted  from  the  chair 
being  pushed  across  the  floor. 

to  prior  trajectories  are  generated.  These  hypotheses  include 
representations  for  false  alarms,  non-detection  events,  exten¬ 
sions  of  prior  trajectories,  and  beginnings  of  new  trajectories. 
The  set  of  all  hypotheses  is  pruned  at  each  time  step  based  on 
statistical  models  of  the  system  noise  levels  and  based  on  the 
similarity  between  detected  targets.  This  similarity  measure¬ 
ment  is  based  on  similarities  of  object  features  such  as  color 
content,  size,  and  visual  moments.  At  any  point,  the  system 
maintains  a  small  set  of  overlapping  hypotheses  so  that  fu¬ 
ture  data  may  be  used  to  disambiguate  the  scene.  Of  course, 
at  any  time  step,  the  system  can  also  produce  the  set  of  non¬ 
overlapping  hypotheses  that  are  statistically  most  likely.  Fig¬ 
ure  2  shows  the  last  frame  of  a  30  frame  sequence  in  which  a 
chair  was  pushed  across  the  floor  and  the  five  trajectories  that 
were  located. 

4  The  Theory  of  Body  Module 

To  implement  the  variety  of  naive  physical  laws  encompassed 
by  the  Theory  of  Body  module,  a  simple  agent-based  ap¬ 
proach  was  chosen.  Each  agent  represents  knowledge  of  a 
single  theory  about  the  behavior  of  inanimate  physical  ob¬ 
jects.  For  every  trajectory  t ,  each  agent  a  computes  both  an 
animacy  vote  ata  and  a  certainty  pta.  The  animacy  votes 
range  from  +1  (indicating  animacy)  to  —1  (indicating  inani¬ 
macy),  and  the  certainties  range  from  1  to  0.  For  these  initial 
tests,  five  agents  were  constructed:  an  insufficient  data  agent, 
a  static  object  agent,  a  straight  line  agent,  an  acceleration  sign 
change  agent,  and  an  energy  agent.  These  agents  were  cho¬ 
sen  to  handle  simple,  common  motion  trajectories  observed 
in  natural  environments,  and  do  not  represent  a  complete  set. 
Most  notably  missing  is  an  agent  to  represent  collisions,  both 
elastic  and  inelastic. 

At  each  time  step,  all  current  trajectories  receive  a  cur¬ 
rent  animacy  vote  V*.  Three  different  voting  algorithms 
were  tested  to  produce  the  final  vote  Vt  for  each  trajectory  t. 
The  first  voting  method  was  a  simple  winner- take- all  vote  in 
which  the  winner  was  declared  to  be  the  agent  with  the  great¬ 


est  absolute  value  of  the  product:  Vt  =  maxa  \\ata  x  pta\\ 
The  second  method  was  an  average  of  all  of  the  individual 
vote  products:  Vt  =  ^  J2a(ata  x  Pta )  where  A  is  the  num¬ 
ber  of  agents  voting.  The  third  method  was  a  weighted  aver¬ 
age  of  the  products  of  the  certainties  and  the  animacy  votes: 
Vt  =  -J  E«(»o  x  «ia  X  Pta )  where  wa  is  the  weight  for 
agent  a.  Weights  were  empirically  chosen  to  maximize  per¬ 
formance  under  normal,  multi-object  conditions  in  natural  en¬ 
vironments  and  were  kept  constant  through  out  this  experi¬ 
ment  as  1.0  for  all  agents  except  the  static  object  agent  which 
had  a  weight  of  2.0.  The  animacy  vote  at  each  time  step  is 
averaged  with  a  time-decaying  weight  function  to  produce  a 
sustained  animacy  measurement. 

4.1  Insufficient  Data  Agent 

The  purpose  of  the  insufficient  data  agent  is  to  quickly  elim¬ 
inate  trajectories  that  contain  too  few  data  points  to  properly 
compute  statistical  information  against  the  noise  background. 
Any  trajectory  with  fewer  than  one-twentieth  the  maximum 
trajectory  length  or  fewer  than  three  data  points  is  given  an 
animacy  vote  a  =  0.0  with  a  certainty  value  of  1.0.  In  prac¬ 
tice,  maximum  trajectory  lengths  of  60-120  were  used  (cor¬ 
responding  to  trajectories  spanning  2-4  seconds),  so  any  tra¬ 
jectory  of  fewer  than  3-6  data  points  was  rejected. 

4.2  Static  Object  Agent 

Because  the  attention  system  still  generates  target  points  for 
objects  that  are  stationary,  there  must  be  an  agent  that  can 
classify  objects  that  are  not  moving  as  inanimate.  The  static 
object  agent  rejects  any  trajectory  that  has  an  accumulated 
translation  below  a  threshold  value  as  inanimate.  The  cer¬ 
tainty  of  the  measurement  is  inversely  proportional  to  the 
translated  distance  and  is  proportional  to  the  length  of  the  tra¬ 
jectory. 

4.3  Straight  Line  Agent 

The  straight  line  agent  looks  for  constant,  sustained  veloci¬ 
ties.  This  agent  computes  the  deviations  of  the  velocity  pro¬ 
file  from  the  average  velocity  vector.  If  the  sum  of  these  devi¬ 
ations  fall  below  a  threshold,  as  would  result  from  a  straight 
linear  movement,  then  the  agent  casts  a  vote  for  inanimacy. 
Below  this  threshold,  the  certainty  is  inversely  proportional 
to  the  sum  of  the  deviations.  If  the  sum  of  the  deviations  is 
above  a  secondary  threshold,  indicating  a  trajectory  with  high 
curvature  or  multiple  curvature  changes,  then  the  agent  casts 
a  vote  for  animacy.  Above  this  threshold,  the  certainty  is  pro¬ 
portional  to  the  sum  of  the  deviations. 

4.4  Acceleration  Sign  Change  Agent 

One  proposal  for  finding  animacy  is  to  look  for  changes  in 
the  sign  of  the  acceleration.  According  to  this  proposal,  any¬ 
thing  that  can  alter  the  direction  of  its  acceleration  must  be 
operating  under  its  own  power  (excluding  contact  with  other 
objects).  The  acceleration  sign  change  agent  looks  for  zero- 
crossings  in  the  acceleration  profile  of  a  trajectory.  Anything 
with  more  than  one  zero-crossing  is  given  an  animacy  vote 
with  a  certainty  proportional  to  the  number  of  zero  crossings. 


Figure  3:  Thirty  stimuli  used  in  the  evaluation  of  ToBY.  Stimuli  were  collected  by  recording  the  position  of  the  most  salient 
object  detected  by  the  attention  system  when  the  robot  observed  natural  scenes  similar  to  the  one  shown  in  Figure  2.  Each 
image  shown  here  is  the  collapsed  sequence  of  video  frames,  with  more  recent  points  being  brighter  than  older  points.  Human 
subjects  saw  only  a  single  bright  point  in  each  frame  of  the  video  sequence. 


4.5  Energy  Agent 

Bingham,  Schmidt,  and  Rosenblum  [1995]  have  proposed 
that  human  adults  judge  animacy  based  on  models  of  po¬ 
tential  and  kinetic  energy.  To  explore  their  hypothesis,  a 
simple  energy  model  agent  was  implemented.  The  energy 
model  agent  judges  an  object  that  gains  energy  to  be  animate. 
The  energy  model  computes  the  total  energy  of  the  system  E 
based  on  a  simple  model  of  kinetic  and  potential  energies: 

E  =  +  mgy  (8) 

where  m  is  the  mass  of  the  object,  vy  the  vertical  velocity,  g 
the  gravity  constant,  and  y  the  vertical  position  in  the  image. 
In  practice,  since  the  mass  is  a  constant  scale  factor,  it  is  not 
included  in  the  calculations.  This  simple  model  assumes  that 
an  object  higher  in  the  image  is  further  from  the  ground,  and 
thus  has  more  potential  energy.  The  vertical  distance  and  ve¬ 
locity  are  measured  using  the  gravity  vector  from  a  three-axis 
inertial  system  as  a  guideline,  allowing  the  robot  to  determine 
“up”  even  when  its  head  is  tilted.  The  certainty  of  the  vote  is 
proportional  to  the  measured  changes  in  energy. 

5  Comparing  ToBY’s  Performance  to  Human 
Performance 

The  performance  of  the  individual  agents  was  evaluated  both 
on  dynamic,  real-world  scenes  at  interactive  rates  and  on 


more  carefully  controlled  recorded  video  sequences. 

For  interactive  video  tasks,  at  each  time  step  five  attention 
targets  were  produced.  Trajectories  were  allowed  to  grow  to  a 
length  of  sixty  frames,  but  additional  information  on  the  long¬ 
term  animacy  scores  for  continuous  trajectories  were  main¬ 
tained  as  described  in  section  4.  All  three  voting  methods 
were  tested.  The  winner-take-all  and  the  weighted  average 
voting  methods  produced  extremely  similar  results,  and  even¬ 
tually  the  winner-take-all  strategy  was  employed  for  simplic¬ 
ity.  The  parameters  of  the  ToBY  module  were  tuned  to  match 
human  judgments  on  long  sequences  of  simple  data  structures 
(such  as  were  produced  by  static  objects  or  people  moving 
back  and  forth  throughout  the  room). 

5.1  Motion  Trajectory  Stimuli 

To  further  evaluate  the  individual  ToBY  agents  on  controlled 
data  sequences,  video  from  the  robot’s  cameras  were  recorded 
and  processed  by  the  attention  system  to  produce  only  a  sin¬ 
gle  salient  object  in  each  frame.1  To  remove  all  potential  con¬ 
textual  cues,  a  new  video  sequence  was  created  containing 
only  a  single  moving  dot  representing  the  path  taken  by  that 

!This  restriction  on  the  number  of  targets  was  imposed  following 
pilot  experiments  using  multiple  targets.  Human  subjects  found  the 
multiple  target  displays  more  difficult  to  observe  and  comprehend. 
Because  each  agent  currently  treats  each  trajectory  independently, 
this  restriction  should  not  bias  the  comparison. 


object  set  against  a  black  background,  which  in  essence  is  the 
only  data  available  to  the  ToBY  system.  Thirty  video  seg¬ 
ments  of  approximately  120  frames  each  were  collected  (see 
Figure  3).  These  trajectories  included  static  objects  (e.g.  #2), 
swinging  pendula  (e.g.  #3),  objects  that  were  thrown  into 
the  air  (e.g.  #7),  as  well  as  more  complicated  trajectories 
(e.g.  #1).  Figure  4  shows  the  trajectories  grouped  according 
to  the  category  of  movement,  and  can  be  matched  to  Figure 
3  using  the  stimulus  number  in  the  second  column.  The  third 
column  of  figure  4  shows  whether  or  not  the  stimulus  was 
animate  or  inanimate. 

5.2  Human  Animacy  Judgments 

Thirty-two  adult,  volunteer  subjects  were  recruited  for  this 
study.  Subjects  ranged  in  age  from  18  to  50,  and  included 
14  women  and  18  men.  Subjects  participated  in  a  web-based 
questionnaire  and  were  informed  that  they  would  be  seeing 
video  sequences  containing  only  a  single  moving  dot,  and  that 
this  dot  represented  the  movement  of  a  real  object.  They  were 
asked  to  rank  each  of  the  thirty  trajectories  shown  in  figure  3 
on  a  scale  of  1  (animate)  to  10  (inanimate).  Following  ini¬ 
tial  pilot  subjects  (not  included  in  this  data),  subjects  were 
reminded  that  inanimate  objects  might  still  move  (such  as  a 
boulder  rolling  down  a  hill)  but  should  still  be  treated  as  inan¬ 
imate.  Subjects  were  allowed  to  review  each  video  sequence 
as  often  as  they  liked,  and  no  time  limit  was  used. 

The  task  facing  subjects  was  inherently  under-constrained, 
and  the  animacy  judgments  showed  high  variance  (a  typical 
variance  for  a  single  stimulus  across  all  subjects  was  2.15). 
Subjects  tended  to  find  multiple  interpretations  for  a  single 
stimulus,  and  there  was  never  a  case  when  all  subjects  agreed 
on  the  animacy/inanimacy  of  a  trajectory.  To  simplify  the 
analysis,  and  to  remove  some  of  the  inter- subject  variability, 
each  response  was  re-coded  from  the  1-10  scale  to  a  single 
animate  (1-5)  or  inanimate  (6-10)  judgment.  Subjects  made 
an  average  of  approximately  8  decisions  that  disagreed  with 
the  ground  truth  values.  This  overall  performance  measure¬ 
ment  of  73%  correct  implies  that  the  task  is  difficult,  but  not 
impossible.  Column  4  of  figure  4  shows  the  percentage  of 
subjects  who  considered  each  stimulus  to  be  animate.  In  two 
cases  (stimuli  #13  and  #9),  the  majority  of  human  subjects 
disagreed  with  the  ground  truth  values.  Stimulus  #9  showed 
a  dot  moving  alternately  up  and  down,  repeating  a  cycle  ap¬ 
proximately  every  300  msec.  Subjects  reported  seeing  this 
movement  as  “too  regular  to  be  animate.”  Stimulus  #13  may 
have  been  confusing  to  subjects  in  that  it  contained  an  inani¬ 
mate  trajectory  (a  ball  being  thrown  and  falling)  that  was  ob¬ 
viously  caused  by  an  animate  (but  unseen)  force. 

5.3  ToBY  Animacy  Judgments 

The  identical  video  sequences  shown  to  the  human  subjects 
were  processed  by  the  trajectory  formation  system  and  the 
ToBY  system.  Trajectory  lengths  were  allowed  to  grow  to 
120  frames  to  take  advantage  of  all  of  the  information  avail¬ 
able  in  each  short  video  clip.  A  winner- take-all  selection 
method  was  imposed  on  the  ToBY  agents  to  simplify  the  re¬ 
porting  of  the  results,  but  subsequent  processing  with  both 
other  voting  methods  produced  identical  results.  The  final 
animacy  judgment  was  determined  to  by  the  winning  agent 


on  the  final  time  step.  Columns  6  and  5  of  figure  4  show  the 
winning  agent  and  that  agent’s  animacy  vote  respectively. 

Overall,  ToBY  agreed  with  the  ground  truth  values  on  23  of 
the  30  stimuli,  and  with  the  majority  of  human  subjects  on  21 
of  the  30  stimuli.  On  the  static  object  categories,  the  circu¬ 
lar  movement  stimuli,  and  the  straight  line  movement  stim¬ 
uli,  ToBY  matched  the  ground  truth  values  perfectly.  This 
system  also  completely  failed  on  all  stimuli  that  had  natu¬ 
ral  pendulum-like  movements.  While  our  original  predictions 
indicated  that  the  energy  agent  should  be  capable  of  dealing 
with  this  class  of  stimuli,  human  subjects  seemed  to  be  re¬ 
sponding  more  to  the  repetitive  nature  of  the  stimulus  rather 
than  the  transfer  between  kinetic  and  potential  energy.  ToBY 
also  failed  on  one  of  the  thrown  objects  (stimulus  #20),  which 
paused  when  it  reached  its  apex,  and  on  one  other  object 
(stimulus  #19)  which  had  a  failure  in  the  trajectory  construc¬ 
tion  phase. 

6  Conclusion 

The  distinction  between  animate  and  inanimate  is  a  funda¬ 
mental  classification  that  humans  as  young  as  6  months  read¬ 
ily  perform.  Based  on  observations  that  humans  can  perform 
these  judgments  based  purely  on  spatio-temporal  signatures, 
this  paper  presented  an  implementation  of  a  few  simple  naive 
rules  for  identifying  animate  objects.  Using  only  the  impover¬ 
ished  stimuli  from  the  attentional  system,  and  without  any  ad¬ 
ditional  context,  adults  were  quite  capable  of  classifying  ani¬ 
mate  and  inanimate  stimuli.  While  the  set  of  agents  explored 
in  this  paper  is  certainly  insufficient  to  capture  all  classes  of 
stimuli,  as  the  pendulum  example  illustrates,  these  five  sim¬ 
ple  rules  are  sufficient  to  explain  a  relatively  broad  class  of 
motion  profiles.  These  simple  algorithms  (like  the  agents  pre¬ 
sented  here)  may  provide  a  quick  first  step,  but  do  not  begin 
to  make  the  same  kinds  of  contextual  judgments  that  humans 
use. 

In  the  future,  we  intend  on  extending  this  analysis  to  in¬ 
clude  comparisons  against  human  performance  for  multi¬ 
target  stimuli  and  for  more  complex  object  interactions  in¬ 
cluding  elastic  and  inelastic  collisions. 
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Figure  4:  Comparison  of  human  animacy  judgments  with  judgments  produced  by  ToBY  for  each  of  the  stimuli  from  figure  3. 
Column  3  is  the  ground  truth,  that  is,  whether  the  trajectory  actually  came  from  an  animate  or  inanimate  source.  Column  4 
shows  the  percentage  of  human  subjects  who  considered  the  stimulus  to  be  animate.  Column  5  shows  the  animacy  judgment 
of  ToBY,  and  column  6  shows  the  agent  that  contributed  that  decision.  Bold  items  in  the  human  or  ToBY  judgment  columns 
indicate  a  disagreement  with  the  ground  truth. 
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